Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add name resolution to threading codemod #71

Merged
merged 7 commits into from
Oct 17, 2023
Merged

Conversation

clavedeluna
Copy link
Contributor

@clavedeluna clavedeluna commented Oct 11, 2023

Overview

Threading codemod that adds a new variable should name the variable to one that's not already in the module scope

Description

  • get all the named vars / attrs / imports in the input code and from there determine if we should name the new lock either lock or lock_cm
  • we find the code module's Global scope to find all the used names. We then pick either lock or lock_cm for the new lock variable, in hopes that either is never used. There may still be a tiny potential for name collusion.
  • Finding the global scope for the node's module is sufficient to get all the used names in the module. I tested this against getting used names for all nodes (function scope, class scope, global scope, etc) and turns out global scope contains everything.

@codecov
Copy link

codecov bot commented Oct 11, 2023

Codecov Report

Merging #71 (800fb77) into main (8128ee6) will increase coverage by 0.08%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #71      +/-   ##
==========================================
+ Coverage   96.13%   96.22%   +0.08%     
==========================================
  Files          46       46              
  Lines        1785     1826      +41     
==========================================
+ Hits         1716     1757      +41     
  Misses         69       69              
Files Coverage Δ
src/codemodder/codemods/utils_mixin.py 90.47% <100.00%> (+3.17%) ⬆️
src/core_codemods/with_threading_lock.py 100.00% <100.00%> (ø)

@clavedeluna clavedeluna marked this pull request as ready for review October 11, 2023 12:24
src/codemodder/codemods/utils_mixin.py Show resolved Hide resolved
src/core_codemods/with_threading_lock.py Outdated Show resolved Hide resolved
@clavedeluna clavedeluna force-pushed the name-resolution branch 2 times, most recently from 9c05109 to b22a4d4 Compare October 12, 2023 11:06
Copy link
Member

@drdavella drdavella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clavedeluna I think this is a really tricky problem so thanks for being patient with the feedback and iterations. I have a few comments that boil down to the following:

  • Can we use more accurate names for semaphores and conditions?
  • Is there a more general iterative solution to avoid name collisions?
  • Are we properly handling multiple changes to the same module?
  • Are we properly handling nested cases?

src/core_codemods/with_threading_lock.py Outdated Show resolved Hide resolved
src/core_codemods/with_threading_lock.py Outdated Show resolved Hide resolved
@@ -86,3 +86,56 @@ def test_no_effect_multiple_with_clauses(self, tmpdir, klass):
...
"""
self.run_and_assert(tmpdir, input_code, input_code)


class TestThreadingNameResolution(BaseSemgrepCodemodTest):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need another test here:

from threading import Lock

with Lock():
    foo()
    
with Lock():
    bar()

This case probably isn't such a concern because variable reuse is probably acceptable. But we also need another test case:

with Lock():
    with Lock():
        foo()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrecsilva wondering if these are the cases you're referring to by suggesting to implement get current names at init? because right now the codemod will change

with Lock():
    with Lock():
        foo()

to

import threading
lock = threading.Lock()
with lock:
    lock = threading.Lock()
    with lock:
        foo()

which is not great since it itself is adding a name clash. This is happening because both calls to leave_With happen one after the other, with the second one not "learning about" the new addition of the first lock = .... I don't believe this would be solved by moving current_names = self.find_used_names_in_module() to codemod init because that too will happen only once.

Copy link
Contributor

@andrecsilva andrecsilva Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was what I was afraid it would happen.
To reiterate, my suggestion is moving the calculation of current_names to WithThreadingLock's __init__ and update current_names with the newly added name within leave_With.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I see what you mean now.

Comment on lines +74 to +96
def find_used_names_in_module(self):
"""
Find all the used names in the scope of a libcst Module.
"""
names = []
scope = self.find_global_scope()
if scope is None:
return [] # pragma: no cover

nodes = [x.node for x in scope.assignments]
for other_nodes in nodes:
visitor = GatherNamesVisitor()
other_nodes.visit(visitor)
names.extend(visitor.names)
return names

def find_global_scope(self):
"""Find the global scope for a libcst Module node."""
scopes = self.context.wrapper.resolve(ScopeProvider).values()
for scope in scopes:
if isinstance(scope, GlobalScope):
return scope
return None # pragma: no cover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it currently is, It's grabbing the names based on the global scope. This means no matter where the Lock() call is detected current_names will always return the same names.

You should calculate the current_names at __init__ and update it accordingly when you add a new name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means no matter where the Lock() call is detected current_names will always return the same names.

this is correct and I believe exactly what we want. When we hit this threading codemod at leave_With, we are hitting a module's with clause. With the current approach, we grab the entire module where the with clause is located and find all the variable / import names. This means we will attempt to never pick a variable name for the lock at any scope, even if we could pick lock bc it's used in some other function. This is the safest thing.

Calculating current_names at init will result in the exact same thing but the downside is that we have to update the codemod architecture. With this approach, all I had to do is to update leave_With and call find_used_names_in_module. This architecture is more generalizable than adding an init call.

If you think I'm mistaken here give me a unit test that would fail the current approach so I can iterate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a question but I'm unclear what "global" scope means in this context. Is it the module-level scope? Or is it some kind of scope containing all variables declared anywhere in the module? I would be rather surprised to find out that it's the latter. But here's a potential test case:

# no conflicts declared at module-level scope
def my_func():
    # Will we correctly detect this conflict?
    lock = "whatever"
    with threading.Lock():
        foo()

I'm definitely curious about this, especially since I think all of the existing test cases are implemented in terms of module-level scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it some kind of scope containing all variables declared anywhere in the module?

It is exactly this. I said it somewhere along the lines but it will in fact find all variable / import names. This is good bc it's extra safe, but of course there are scopes in which we could reuse but we don't.

I"ve added the test case that you've pointed out and it passes as expected - we add a new lock_1 variable right under lock = "whatever"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's module-level scope. I believe the current implementation will be able to detect the conflict in your particular example.

@clavedeluna
Copy link
Contributor Author

Alright so here is the current state of this PR and addressing some of Dan's questions

Can we use more accurate names for semaphores and conditions?

I have updated the PR so that variable names would be lock or lock_cm or condition or condition_cm, etc. I still retained the _cm suffix because I see it equivalent to adding a _1 suffix. I can change if you disagree.

Is there a more general iterative solution to avoid name collisions?

The current general solution is you'd be able to call current_names = self.find_used_names_in_module() in any codemod. See caveats below.

Are we properly handling multiple changes to the same module?
Are we properly handling nested cases?

I have added the unit test for these and what we find is that while we can properly handle name collision for existing code, we can't properly handle our own introduction of name collision. That is, in the rare case of having to add two locks/semaphores, etc to a file, we will re-use the same variable name. This is not as terrible as it sounds since it will just re-create the thread. However, it isn't great. I tried to look into why self.find_used_names_in_module() doesn't reflect the added lock variable when the second with condition is called. It seems to be because libcst is calling on these two with statements in sequence, one after the other, and the global state isn't getting updated.

So what should we do? I'm sure libcst is smart enough to be able to handle this, but not with the current architecture. The codemod will have to be a bit more complex to handle this relatively rare case. I'd like to propose we accept the current implementation with this edge case as is and come back to it. This is an interesting yet LOW importance codemod, so I'd like to move on to more important ones. The added unit tests currently demonstrate this drawback.

@clavedeluna clavedeluna force-pushed the name-resolution branch 2 times, most recently from 489832f to 330bb6b Compare October 16, 2023 13:07
@andrecsilva
Copy link
Contributor

Are we properly handling multiple changes to the same module?
Are we properly handling nested cases?

I have added the unit test for these and what we find is that while we can properly handle name collision for existing code, we can't properly handle our own introduction of name collision. That is, in the rare case of having to add two locks/semaphores, etc to a file, we will re-use the same variable name. This is not as terrible as it sounds since it will just re-create the thread. However, it isn't great. I tried to look into why self.find_used_names_in_module() doesn't reflect the added lock variable when the second with condition is called. It seems to be because libcst is calling on these two with statements in sequence, one after the other, and the global state isn't getting updated.

So what should we do? I'm sure libcst is smart enough to be able to handle this, but not with the current architecture. The codemod will have to be a bit more complex to handle this relatively rare case. I'd like to propose we accept the current implementation with this edge case as is and come back to it. This is an interesting yet LOW importance codemod, so I'd like to move on to more important ones. The added unit tests currently demonstrate this drawback.

In libcst, the tree won't actually change while visiting the nodes. That is, when you return a node within a leave_* method, the changes won't really be "committed" to the tree right away. The tree will only actually transform after the visiting is done (this is the reason things like RemovalSentinel exists). Thus, the metadata is calculated against the original tree when visiting. This is by design and unlikely to change.

An naive solution to that is to update a single call per codemod run. This way the metadata will be accurate after every change. However this will be fairly expensive and a simple solution of keeping track of the newly added names should do.

I can probably scrounge a more complex (and general) solution that is scope aware and can deal with the edge cases within a single codemod call. I don't mind revisiting this later after I'm done with my current tasks.

@clavedeluna
Copy link
Contributor Author

Alright! I believe I've now addressed all issues in this PR.

Copy link
Member

@drdavella drdavella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the unit test for these and what we find is that while we can properly handle name collision for existing code, we can't properly handle our own introduction of name collision. That is, in the rare case of having to add two locks/semaphores, etc to a file, we will re-use the same variable name. This is not as terrible as it sounds since it will just re-create the thread. However, it isn't great. I tried to look into why self.find_used_names_in_module() doesn't reflect the added lock variable when the second with condition is called. It seems to be because libcst is calling on these two with statements in sequence, one after the other, and the global state isn't getting updated.

Why can't we just keep a list of variable names that we have created and make sure that we don't collide with any of those either? For example, if the codemod adds a variable named lock, can't we just record that somehow and make sure we don't reuse that name in a subsequent change?

@clavedeluna
Copy link
Contributor Author

I have added the unit test for these and what we find is that while we can properly handle name collision for existing code, we can't properly handle our own introduction of name collision. That is, in the rare case of having to add two locks/semaphores, etc to a file, we will re-use the same variable name. This is not as terrible as it sounds since it will just re-create the thread. However, it isn't great. I tried to look into why self.find_used_names_in_module() doesn't reflect the added lock variable when the second with condition is called. It seems to be because libcst is calling on these two with statements in sequence, one after the other, and the global state isn't getting updated.

Why can't we just keep a list of variable names that we have created and make sure that we don't collide with any of those either? For example, if the codemod adds a variable named lock, can't we just record that somehow and make sure we don't reuse that name in a subsequent change?

I was able to do just that finally. Sorry for the back and forth, I needed some further explanation on the review from Andre. I now am able to do exactly what we want.

@clavedeluna clavedeluna requested a review from drdavella October 16, 2023 18:40
with lock_1:
lock = threading.Lock()
with lock:
print()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notice this test. libcst processes the internal with statement first, so our naming will be like this. Not a huuuge deal

Copy link
Member

@drdavella drdavella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the continued back and forth but I'm asking for another test case which should help clarify @andrecsilva's question.

Comment on lines +74 to +96
def find_used_names_in_module(self):
"""
Find all the used names in the scope of a libcst Module.
"""
names = []
scope = self.find_global_scope()
if scope is None:
return [] # pragma: no cover

nodes = [x.node for x in scope.assignments]
for other_nodes in nodes:
visitor = GatherNamesVisitor()
other_nodes.visit(visitor)
names.extend(visitor.names)
return names

def find_global_scope(self):
"""Find the global scope for a libcst Module node."""
scopes = self.context.wrapper.resolve(ScopeProvider).values()
for scope in scopes:
if isinstance(scope, GlobalScope):
return scope
return None # pragma: no cover
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a question but I'm unclear what "global" scope means in this context. Is it the module-level scope? Or is it some kind of scope containing all variables declared anywhere in the module? I would be rather surprised to find out that it's the latter. But here's a potential test case:

# no conflicts declared at module-level scope
def my_func():
    # Will we correctly detect this conflict?
    lock = "whatever"
    with threading.Lock():
        foo()

I'm definitely curious about this, especially since I think all of the existing test cases are implemented in terms of module-level scope.

Copy link
Contributor

@andrecsilva andrecsilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small changes to a test.

tests/codemods/test_with_threading_lock.py Outdated Show resolved Hide resolved
Copy link
Member

@drdavella drdavella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for all of the back-and-forth 😅

@clavedeluna clavedeluna merged commit 8b93bae into main Oct 17, 2023
@clavedeluna clavedeluna deleted the name-resolution branch October 17, 2023 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants